Skip to content

Add initial LoRA finetuning support; vulkan OUT_PROD; vulkan cross-entropy-backward #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: temp-finetuning
Choose a base branch
from

Conversation

makaveli10
Copy link

The PR adds:

  • LoRA finetuning support for both training a new adapter or finetuning an existing adapter. And saved the adapter at the end of the training run to be used as required for inference.
  • cuda: OUT_PROD Q8/Q4 for quantised lora finetuning.
  • vulkan: Added OUT_PROD operator for fp32 to enable finetuning. Added OUT_PROD Q8, Q4 to enable quantised finetuning.
  • vulkan: Added cross-entropy-loss-backward to allow lower context size which is critical for training on mobile device due to memory constraint.

@zoq
Copy link

zoq commented Aug 19, 2025

Steps to test llama.cpp inference on Android:

  1. Install Termux from the PlayStore and open it.
  2. Run apt update
  3. Run apt remove vulkan-loader-generic
  4. Run apt install git cmake vulkan-tools vulkan-headers shaderc vulkan-loader-android
  5. Run vulkaninfo --summary: This should show the driver and gpu information. If it's the stock driver, it shouldn't mention Mesa.
  6. git clone the repo inside termux and cd into it.
git clone https://github.com/makaveli10/qvac-ext-lib-llama.cpp.git
git checkout lora-finetuning

make sure to checkout the lora-finetuning branch
7. Configure the vulkan backend build with cmake -B build -DGGML_VULKAN=1
8. Build it with cmake --build build --config Debug -j2
9. Run termux-setup-storage and give storage permissions to termux.
10. Outside termux, download a model on the phone, click on it and select to open it with termux. Download the model from here: https://huggingface.co/prithivMLmods/Qwen3-0.6B-GGUF/tree/main i.e. download Qwen3_0.6B.Q8_0.gguf
11. Click "Open Directory" on the prompt.
12. The model should now be reachable inside termux in the ~/downloads directory.
13. For finetunine 8 bit Qwen:

./build/bin/llama-finetune-lora -m Qwen3_0.6B.Q8_0.gguf -f trump.txt -c 256 -b 256 -ub 256 -ngl 999

trump.txt dataset: https://github.com/user-attachments/files/21859494/trump.txt

@zoq
Copy link

zoq commented Aug 19, 2025

@zoq
Copy link

zoq commented Aug 19, 2025

./build/bin/llama-cli -m Qwen3_0.6B.Q8_0.gguf --lora trained-lora-adapter.gguf -if -p "What is your favorite pokemon?" -ngl 999

command we used for testing

Copy link

@andrunko andrunko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM in general, just some small comments/nits overall, feel free to ignore the nitpicks :).

device->device.destroyBuffer(buffer);
device->device.freeMemory(device_memory);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good but the commit msg should be updated to remove wip - would also be good to explain what specific crash this fixes in the commit msg.

@@ -93,4 +93,4 @@ int main(int argc, char ** argv) {
llama_backend_free();

return 0;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nothing changed, I'd drop it from the commit.

@@ -3202,7 +3202,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
}
} break;
case GGML_OP_OUT_PROD:
return op->type == GGML_TYPE_F32 && op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32;
// return op->type == GGML_TYPE_F32 && op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to keep this prev code? We can always check git history if we need to revert.

const float * src0_d = (const float *) src0->data;
const float * src1_d = (const float *) src1->data;
// const float * src0_d = (const float *) src0->data;
// const float * src1_d = (const float *) src1->data;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and in other places, I would drop the old code in general.


if (allocated_src0) {
CUDA_CHECK(cudaFreeAsync(src0_f32, stream));
// printf("DEBUG: Freed dequantized src0 buffer\n");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while here I would also remove these leftover debugs - here and in other similar places.

case GGML_OP_ADD:
case GGML_OP_SUB:
case GGML_OP_MUL:
case GGML_OP_DIV:
return (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16) &&
return (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16) &&
Copy link

@andrunko andrunko Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spurious change?

@andrunko
Copy link

andrunko commented Aug 21, 2025

Looks like there are some CI failures also related to these changes - see https://github.com/tetherto/qvac-ext-lib-llama.cpp/actions/runs/17076253696/job/48418341198?pr=5 for example:

/__w/qvac-ext-lib-llama.cpp/qvac-ext-lib-llama.cpp/src/llama-lora-training.cpp:293:29: error: the address of 'ggml_tensor::name' will never be NULL [-Werror=address]
  293 |     if (!tensor || !tensor->name) {
      |                     ~~~~~~~~^~~~

@JamieBohannaWebDev
Copy link

JamieBohannaWebDev commented Aug 22, 2025

Fine Tuning attempt on Pixel 9 Pro Fold Evidence below.

Please note the 27.5 hours estimated completion time...

37156 37157 37158 37159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants